-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Using cached StoredFieldsReader for scroll query optimizations #20112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using cached StoredFieldsReader for scroll query optimizations #20112
Conversation
Signed-off-by: Prudhvi Godithi <[email protected]>
WalkthroughAdds per-segment StoredFieldsReader caching for scrolls: ScrollContext gains a cache, caching APIs, and close semantics; FetchPhase uses the cache for active scroll fetches; LegacyReaderContext registers scroll contexts for lifecycle cleanup. New unit and integration tests validate caching and several test classes receive codec-suppression annotations. Changes
Sequence DiagramsequenceDiagram
participant Client
participant FetchPhase
participant ScrollContext
participant StoredFieldsReader
Client->>FetchPhase: fetch request (scroll)
activate FetchPhase
FetchPhase->>FetchPhase: is scroll active?
alt scroll active
FetchPhase->>ScrollContext: getCachedSequentialReader(segmentKey)
activate ScrollContext
alt cached
ScrollContext-->>FetchPhase: cached StoredFieldsReader
else not cached
ScrollContext-->>FetchPhase: null
FetchPhase->>StoredFieldsReader: create sequential reader
activate StoredFieldsReader
StoredFieldsReader-->>FetchPhase: new reader
deactivate StoredFieldsReader
FetchPhase->>ScrollContext: cacheSequentialReader(segmentKey, reader)
end
deactivate ScrollContext
FetchPhase->>StoredFieldsReader: use reader to fetch stored fields
else not a scroll
FetchPhase->>StoredFieldsReader: get sequential reader (no cache)
StoredFieldsReader-->>FetchPhase: reader
end
FetchPhase-->>Client: return documents
deactivate FetchPhase
Client->>ScrollContext: close scroll
activate ScrollContext
ScrollContext->>StoredFieldsReader: close each cached reader
ScrollContext->>ScrollContext: clear cache
deactivate ScrollContext
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🧰 Additional context used🧠 Learnings (1)📚 Learning: 2025-12-02T22:44:14.761ZApplied to files:
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
🔇 Additional comments (3)
Comment |
|
{"run-benchmark-test": "id_3"} |
|
{"run-benchmark-test": "id_11"} |
|
The Jenkins job url is https://build.ci.opensearch.org/job/benchmark-pull-request/5213/ . Final results will be published once the job is completed. |
|
❌ Gradle check result for 56dca5e: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Benchmark ResultsBenchmark Results for Job: https://build.ci.opensearch.org/job/benchmark-pull-request/5213/
|
Benchmark Baseline Comparison ResultsBenchmark Results for Job: https://build.ci.opensearch.org/job/benchmark-compare/214/
|
Signed-off-by: Prudhvi Godithi <[email protected]>
|
{"run-benchmark-test": "id_3"} |
|
{"run-benchmark-test": "id_3"} |
|
The Jenkins job url is https://build.ci.opensearch.org/job/benchmark-pull-request/5214/ . Final results will be published once the job is completed. |
|
❌ Gradle check result for c5e1e19: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Benchmark ResultsBenchmark Results for Job: https://build.ci.opensearch.org/job/benchmark-pull-request/5214/
|
Benchmark Baseline Comparison ResultsBenchmark Results for Job: https://build.ci.opensearch.org/job/benchmark-compare/215/
|
|
{"run-benchmark-test": "id_3"} |
|
The Jenkins job url is https://build.ci.opensearch.org/job/benchmark-pull-request/5224/ . Final results will be published once the job is completed. |
|
{"run-benchmark-test": "id_11"} |
|
❌ Gradle check result for c5e1e19: TIMEOUT Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Benchmark ResultsBenchmark Results for Job: https://build.ci.opensearch.org/job/benchmark-pull-request/5224/
|
Benchmark Baseline Comparison ResultsBenchmark Results for Job: https://build.ci.opensearch.org/job/benchmark-compare/216/
|
|
{"run-benchmark-test": "id_11"} |
|
The Jenkins job url is https://build.ci.opensearch.org/job/benchmark-pull-request/5226/ . Final results will be published once the job is completed. |
|
{"run-benchmark-test": "id_3"} |
|
The Jenkins job url is https://build.ci.opensearch.org/job/benchmark-pull-request/5227/ . Final results will be published once the job is completed. |
Benchmark ResultsBenchmark Results for Job: https://build.ci.opensearch.org/job/benchmark-pull-request/5226/
|
|
Hello! |
|
Benchmark shows ~19% improvement, very nice! |
|
❌ Gradle check result for 1ba5185: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Prudhvi Godithi <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
server/src/internalClusterTest/java/org/opensearch/search/basic/TransportTwoNodesSearchIT.java (1)
84-92: Scope and assumptions behind@SuppressCodecs("*")The explanatory Javadoc is helpful, and aligning this IT with other scroll-related tests using
@LuceneTestCase.SuppressCodecs("*")makes the intent clear. Two points to tighten up:
- The comment asserts that different scroll batches may run on different threads “(but never concurrently)”. Since
StoredFieldsReaderis stateful, it would be good to explicitly treat this as an invariant you rely on (and ideally reference or enforce it nearScrollContext#getCachedSequentialReader), so future refactors don’t accidentally introduce concurrent use of the same cached reader.@LuceneTestCase.SuppressCodecsis documented as avoiding certain codec types; using"*"broadly disables randomized codecs for this class. Please double‑check against the Lucene version in this repo that"*"has the intended effect (primarily avoidingAssertingCodec/AssertingStoredFieldsFormatwhile still exercising a realistic production codec), and consider narrowing the suppression if feasible so you retain more codec coverage.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
server/src/internalClusterTest/java/org/opensearch/search/basic/TransportTwoNodesSearchIT.java(2 hunks)server/src/internalClusterTest/java/org/opensearch/search/slice/SearchSliceIT.java(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
server/src/internalClusterTest/java/org/opensearch/search/slice/SearchSliceIT.java (1)
server/src/internalClusterTest/java/org/opensearch/search/basic/TransportTwoNodesSearchIT.java (1)
LuceneTestCase(92-460)
server/src/internalClusterTest/java/org/opensearch/search/basic/TransportTwoNodesSearchIT.java (3)
server/src/internalClusterTest/java/org/opensearch/search/slice/SearchSliceIT.java (1)
LuceneTestCase(81-369)server/src/internalClusterTest/java/org/opensearch/search/scroll/ScrollStoredFieldsCacheIT.java (1)
LuceneTestCase(38-151)server/src/internalClusterTest/java/org/opensearch/search/scroll/SearchScrollIT.java (1)
LuceneTestCase(106-913)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
- GitHub Check: check-files
- GitHub Check: precommit (21, windows-2025, true)
- GitHub Check: precommit (25, macos-15-intel)
- GitHub Check: precommit (21, ubuntu-latest)
- GitHub Check: precommit (25, windows-latest)
- GitHub Check: precommit (25, macos-15)
- GitHub Check: precommit (21, macos-15)
- GitHub Check: precommit (25, ubuntu-latest)
- GitHub Check: precommit (25, ubuntu-24.04-arm)
- GitHub Check: precommit (21, windows-latest)
- GitHub Check: precommit (21, ubuntu-24.04-arm)
- GitHub Check: precommit (21, macos-15-intel)
- GitHub Check: assemble (25, windows-latest)
- GitHub Check: assemble (25, ubuntu-24.04-arm)
- GitHub Check: assemble (21, ubuntu-24.04-arm)
- GitHub Check: assemble (21, ubuntu-latest)
- GitHub Check: assemble (25, ubuntu-latest)
- GitHub Check: assemble (21, windows-latest)
- GitHub Check: detect-breaking-change
- GitHub Check: Analyze (java)
🔇 Additional comments (3)
server/src/internalClusterTest/java/org/opensearch/search/slice/SearchSliceIT.java (2)
37-37: LGTM!The import is necessary for the
@LuceneTestCase.SuppressCodecsannotation used on the test class.
74-77: Verify thread safety guarantees for cross-thread StoredFieldsReader access.The javadoc states that "different batches may run on different threads (but never concurrently)." However, thread pool execution patterns don't inherently guarantee sequential access ordering or that concurrent access cannot occur under load. Given that Lucene maintainers have confirmed StoredFieldsReader is stateful and should not be shared across threads, verify that ScrollContext's caching implementation includes appropriate synchronization mechanisms (locks, synchronized blocks, or thread-local storage) to enforce the sequential cross-thread access guarantee.
server/src/internalClusterTest/java/org/opensearch/search/basic/TransportTwoNodesSearchIT.java (1)
37-37: Import forLuceneTestCaseis appropriateThe new import is required for the
@LuceneTestCase.SuppressCodecsannotation and is consistent with other scroll-related ITs using the same annotation.
server/src/internalClusterTest/java/org/opensearch/search/slice/SearchSliceIT.java
Show resolved
Hide resolved
|
❌ Gradle check result for 94d599a: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
For #20112 (comment) I have a PR in Lucene apache/lucene#15468, Once agreed and released we can use |
|
❌ Gradle check result for 94d599a: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
The The automation should create a flaky issue for |
|
❌ Gradle check result for 94d599a: null Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
❌ Gradle check result for 94d599a: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Prudhvi Godithi <[email protected]>
|
❌ Gradle check result for 1cc0d36: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
@msfroh @rishabhmaurya Just a gentle ping. |
|
Tagging @jainankitk, please take a look. This should be a good improvement and we can add it for 3.4.0 opensearch-project/opensearch-build#5764. |
msfroh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this should be good.
From my understanding of LegacyReaderContext and ReaderContext, this will work with scrolls, but not with PITs, which should be fine. In general, I don't think we can guarantee that PITs will access documents sequentially.
Incidentally, this should yield a significant improvement for any subclass of AbstractBulkByScrollRequest -- that is update-by-query, delete-by-query, and reindex.
Can you try benchmarking a large reindex operation with and without this change? I'd be curious to see the impact. (I imagine it would be decent, depending on how much time is spent on the read side versus the write side of the reindex task.)
Thanks Froh. Make sense as if i'm not wrong |
00cf7c1
into
opensearch-project:main
* Scroll query optimizations Signed-off-by: Prudhvi Godithi <[email protected]> * spotless fix Signed-off-by: Prudhvi Godithi <[email protected]> * Scroll query optimizations Signed-off-by: Prudhvi Godithi <[email protected]> * spotless fix Signed-off-by: Prudhvi Godithi <[email protected]> * spotless fix Signed-off-by: Prudhvi Godithi <[email protected]> * Fix assertions Signed-off-by: Prudhvi Godithi <[email protected]> * Fix assertions Signed-off-by: Prudhvi Godithi <[email protected]> * Update tests Signed-off-by: Prudhvi Godithi <[email protected]> * update changelog Signed-off-by: Prudhvi Godithi <[email protected]> * Fix SearchSliceIT Signed-off-by: Prudhvi Godithi <[email protected]> * Fix RetryTests Signed-off-by: Prudhvi Godithi <[email protected]> --------- Signed-off-by: Prudhvi Godithi <[email protected]> (cherry picked from commit 00cf7c1) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
… (#20170) * Scroll query optimizations * spotless fix * Scroll query optimizations * spotless fix * spotless fix * Fix assertions * Fix assertions * Update tests * update changelog * Fix SearchSliceIT * Fix RetryTests --------- (cherry picked from commit 00cf7c1) Signed-off-by: Prudhvi Godithi <[email protected]> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
…earch-project#20112) * Scroll query optimizations Signed-off-by: Prudhvi Godithi <[email protected]> * spotless fix Signed-off-by: Prudhvi Godithi <[email protected]> * Scroll query optimizations Signed-off-by: Prudhvi Godithi <[email protected]> * spotless fix Signed-off-by: Prudhvi Godithi <[email protected]> * spotless fix Signed-off-by: Prudhvi Godithi <[email protected]> * Fix assertions Signed-off-by: Prudhvi Godithi <[email protected]> * Fix assertions Signed-off-by: Prudhvi Godithi <[email protected]> * Update tests Signed-off-by: Prudhvi Godithi <[email protected]> * update changelog Signed-off-by: Prudhvi Godithi <[email protected]> * Fix SearchSliceIT Signed-off-by: Prudhvi Godithi <[email protected]> * Fix RetryTests Signed-off-by: Prudhvi Godithi <[email protected]> --------- Signed-off-by: Prudhvi Godithi <[email protected]>
Description
When testing the big5 scroll query, in the following flamegraph I can see the
Lucene90CompressingStoredFieldsReaderis called multiple times for all the scroll batches. The idea is cache the StoredFieldsReader in a private field within SequentialStoredFieldsLeafReader. The cached reader is reused across all subsequent calls to getSequentialStoredFieldsReader() for the lifetime of the SequentialStoredFieldsLeafReader instance.Related Issues
There is a regression issue reported in past #16262. This enhancement could help reduce the scroll API regressions.
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.
Summary by CodeRabbit
Bug Fixes
New Features / Performance
Tests
Chores
✏️ Tip: You can customize this high-level summary in your review settings.